Automatic Topic Identification for Large Scale Language Modeling Data Filtering

نویسندگان

  • Lucie Skorkovská
  • Pavel Ircing
  • Ales Prazák
  • Jan Lehecka
چکیده

The paper presents a module for topic identification that is embedded into a complex system for acquisition and storing large volumes of text data from the Web. The module processes each of the acquired data items and assigns keywords to them from a defined topic hierarchy that was developed for this purposes and is also described in the paper. The quality of the topic identification is evaluated in two ways using classic precision-recall measures and also indirectly, by measuring the ASR performance of the topic-specific language models that are built using the automatically filtered data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering

The paper presents experiments with the topic identification module which is a part of a complex system for acquisition and storing large volumes of text data. The topic identification module processes each acquired data item and assigns it topics from a defined topic hierarchy. The topic hierarchy is quite extensive – it contains about 450 topics and topic categories. It can easily happen that...

متن کامل

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

In-Network Phased Filtering Mechanism for a Large-Scale RFID Inventory Application

-RFID technology is one of automatic identification technologies. In current RFID systems, RFID data are managed and processed by a middleware. In the near future, when RFID technology will be applied to large scale warehouses, airports, or seaports, it is necessary that wireless sensors integrated a RFID reader construct wireless sensor network because of difficulties of building wired network...

متن کامل

Business Rule Based Extension of a Semantic Process Modeling Language for Managing Business Process Compliance in the Financial Sector

Managing business process compliance is an important topic in the financial sector. Various scandals and the financial crisis have caused many new constraints and legal regulations that banks and financial institutions have to face. Based on a domain-specific semantic business process modeling notation we propose generic process compliance business rules that serve as a first step towards the i...

متن کامل

Large Scale Distributed Acoustic Modeling With Back-Off ℕ-Grams

The paper revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition. In such a data-rich setting, we can expand the phonetic context significantl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011